Statistical Learning Project - Player position classification

1.Abstract and motivation

FIFA is one of the most known videogame and the most famous sport title in the industry, in particular we considered FIFA 22 edition. Each player covers a specific position on the field; what we want to do is building some models to classify the position of the player, based on the values of its attributes. It’s important to consider that some players may share some features with footballers playing in another position, and this may influence our task. For example, some attacking midfielders (CAM) have a good shot and pace, just like wingers (RW, LW). We will keep this into account and adjust our classification accordingly.

2. The dataset - Description &EDA

The original dataset has been extracted from https://sofifa.com/ and contains 19239 players described by 110 different features.

2.1 DataFrame inspection and rough slicing

## Loading required package: viridisLite
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Warning: package 'gmodels' was built under R version 4.2.1
## Warning: package 'e1071' was built under R version 4.2.1
## Warning: package 'tidyverse' was built under R version 4.2.1
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.7     v purrr   0.3.4
## v tidyr   1.2.0     v forcats 0.5.1
## v readr   2.1.2
## Warning: package 'readr' was built under R version 4.2.1
## Warning: package 'forcats' was built under R version 4.2.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
## Warning: package 'corrplot' was built under R version 4.2.1
## corrplot 0.92 loaded
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## Warning: package 'reshape' was built under R version 4.2.1
## 
## Attaching package: 'reshape'
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
## The following object is masked from 'package:class':
## 
##     condense
## The following object is masked from 'package:dplyr':
## 
##     rename
## The following objects are masked from 'package:reshape2':
## 
##     colsplit, melt, recast
## Warning: package 'caret' was built under R version 4.2.1
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## Warning: package 'randomForest' was built under R version 4.2.1
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Warning: package 'cvms' was built under R version 4.2.1

We set the seed for reproducible experiments

set.seed(123)

First we load the dataset, and check the dimension.

players_full <- read.csv("E:/horatiu/Documents/players_22.csv") #full dataframe
dim(players_full) #full dataset
## [1] 19239   110

We have more or less 20k players with 110 attributes. Below we look at how those attributes are named.

colnames(players_full)
##   [1] "sofifa_id"                   "player_url"                 
##   [3] "short_name"                  "long_name"                  
##   [5] "player_positions"            "overall"                    
##   [7] "potential"                   "value_eur"                  
##   [9] "wage_eur"                    "age"                        
##  [11] "dob"                         "height_cm"                  
##  [13] "weight_kg"                   "club_team_id"               
##  [15] "club_name"                   "league_name"                
##  [17] "league_level"                "club_position"              
##  [19] "club_jersey_number"          "club_loaned_from"           
##  [21] "club_joined"                 "club_contract_valid_until"  
##  [23] "nationality_id"              "nationality_name"           
##  [25] "nation_team_id"              "nation_position"            
##  [27] "nation_jersey_number"        "preferred_foot"             
##  [29] "weak_foot"                   "skill_moves"                
##  [31] "international_reputation"    "work_rate"                  
##  [33] "body_type"                   "real_face"                  
##  [35] "release_clause_eur"          "player_tags"                
##  [37] "player_traits"               "pace"                       
##  [39] "shooting"                    "passing"                    
##  [41] "dribbling"                   "defending"                  
##  [43] "physic"                      "attacking_crossing"         
##  [45] "attacking_finishing"         "attacking_heading_accuracy" 
##  [47] "attacking_short_passing"     "attacking_volleys"          
##  [49] "skill_dribbling"             "skill_curve"                
##  [51] "skill_fk_accuracy"           "skill_long_passing"         
##  [53] "skill_ball_control"          "movement_acceleration"      
##  [55] "movement_sprint_speed"       "movement_agility"           
##  [57] "movement_reactions"          "movement_balance"           
##  [59] "power_shot_power"            "power_jumping"              
##  [61] "power_stamina"               "power_strength"             
##  [63] "power_long_shots"            "mentality_aggression"       
##  [65] "mentality_interceptions"     "mentality_positioning"      
##  [67] "mentality_vision"            "mentality_penalties"        
##  [69] "mentality_composure"         "defending_marking_awareness"
##  [71] "defending_standing_tackle"   "defending_sliding_tackle"   
##  [73] "goalkeeping_diving"          "goalkeeping_handling"       
##  [75] "goalkeeping_kicking"         "goalkeeping_positioning"    
##  [77] "goalkeeping_reflexes"        "goalkeeping_speed"          
##  [79] "ls"                          "st"                         
##  [81] "rs"                          "lw"                         
##  [83] "lf"                          "cf"                         
##  [85] "rf"                          "rw"                         
##  [87] "lam"                         "cam"                        
##  [89] "ram"                         "lm"                         
##  [91] "lcm"                         "cm"                         
##  [93] "rcm"                         "rm"                         
##  [95] "lwb"                         "ldm"                        
##  [97] "cdm"                         "rdm"                        
##  [99] "rwb"                         "lb"                         
## [101] "lcb"                         "cb"                         
## [103] "rcb"                         "rb"                         
## [105] "gk"                          "player_face_url"            
## [107] "club_logo_url"               "club_flag_url"              
## [109] "nation_logo_url"             "nation_flag_url"

To get a better general idea, we also want to look at the type of data they provide

head(players_full, 10)
##    sofifa_id                                                         player_url
## 1     158023               https://sofifa.com/player/158023/lionel-messi/220002
## 2     188545         https://sofifa.com/player/188545/robert-lewandowski/220002
## 3      20801 https://sofifa.com/player/20801/c-ronaldo-dos-santos-aveiro/220002
## 4     190871  https://sofifa.com/player/190871/neymar-da-silva-santos-jr/220002
## 5     192985            https://sofifa.com/player/192985/kevin-de-bruyne/220002
## 6     200389                  https://sofifa.com/player/200389/jan-oblak/220002
## 7     231747              https://sofifa.com/player/231747/kylian-mbappe/220002
## 8     167495               https://sofifa.com/player/167495/manuel-neuer/220002
## 9     192448      https://sofifa.com/player/192448/marc-andre-ter-stegen/220002
## 10    202126                 https://sofifa.com/player/202126/harry-kane/220002
##           short_name                           long_name player_positions
## 1           L. Messi     Lionel Andrés Messi Cuccittini       RW, ST, CF
## 2     R. Lewandowski                  Robert Lewandowski               ST
## 3  Cristiano Ronaldo Cristiano Ronaldo dos Santos Aveiro           ST, LW
## 4          Neymar Jr      Neymar da Silva Santos Júnior          LW, CAM
## 5       K. De Bruyne                     Kevin De Bruyne          CM, CAM
## 6           J. Oblak                           Jan Oblak               GK
## 7         K. Mbappé               Kylian Mbappé Lottin           ST, LW
## 8           M. Neuer                  Manuel Peter Neuer               GK
## 9      M. ter Stegen              Marc-André ter Stegen               GK
## 10           H. Kane                          Harry Kane               ST
##    overall potential value_eur wage_eur age        dob height_cm weight_kg
## 1       93        93  78000000   320000  34 1987-06-24       170        72
## 2       92        92 119500000   270000  32 1988-08-21       185        81
## 3       91        91  45000000   270000  36 1985-02-05       187        83
## 4       91        91 129000000   270000  29 1992-02-05       175        68
## 5       91        91 125500000   350000  30 1991-06-28       181        70
## 6       91        93 112000000   130000  28 1993-01-07       188        87
## 7       91        95 194000000   230000  22 1998-12-20       182        73
## 8       90        90  13500000    86000  35 1986-03-27       193        93
## 9       90        92  99000000   250000  29 1992-04-30       187        85
## 10      90        90 129500000   240000  27 1993-07-28       188        89
##    club_team_id           club_name            league_name league_level
## 1            73 Paris Saint-Germain         French Ligue 1            1
## 2            21  FC Bayern München   German 1. Bundesliga            1
## 3            11   Manchester United English Premier League            1
## 4            73 Paris Saint-Germain         French Ligue 1            1
## 5            10     Manchester City English Premier League            1
## 6           240 Atlético de Madrid Spain Primera Division            1
## 7            73 Paris Saint-Germain         French Ligue 1            1
## 8            21  FC Bayern München   German 1. Bundesliga            1
## 9           241        FC Barcelona Spain Primera Division            1
## 10           18   Tottenham Hotspur English Premier League            1
##    club_position club_jersey_number club_loaned_from club_joined
## 1             RW                 30                   2021-08-10
## 2             ST                  9                   2014-07-01
## 3             ST                  7                   2021-08-27
## 4             LW                 10                   2017-08-03
## 5            RCM                 17                   2015-08-30
## 6             GK                 13                   2014-07-16
## 7             ST                  7                   2018-07-01
## 8             GK                  1                   2011-07-01
## 9             GK                  1                   2014-07-01
## 10            ST                 10                   2010-07-28
##    club_contract_valid_until nationality_id nationality_name nation_team_id
## 1                       2023             52        Argentina           1369
## 2                       2023             37           Poland           1353
## 3                       2023             38         Portugal           1354
## 4                       2025             54           Brazil             NA
## 5                       2025              7          Belgium           1325
## 6                       2023             44         Slovenia             NA
## 7                       2022             18           France           1335
## 8                       2023             21          Germany           1337
## 9                       2025             21          Germany             NA
## 10                      2024             14          England           1318
##    nation_position nation_jersey_number preferred_foot weak_foot skill_moves
## 1               RW                   10           Left         4           4
## 2               RS                    9          Right         4           4
## 3               ST                    7          Right         4           5
## 4                                    NA          Right         5           5
## 5              RCM                    7          Right         5           4
## 6                                    NA          Right         3           1
## 7               LW                   10          Right         4           5
## 8               GK                    1          Right         4           1
## 9                                    NA          Right         4           1
## 10              ST                    9          Right         5           3
##    international_reputation     work_rate body_type real_face
## 1                         5    Medium/Low    Unique       Yes
## 2                         5   High/Medium    Unique       Yes
## 3                         5      High/Low    Unique       Yes
## 4                         5   High/Medium    Unique       Yes
## 5                         4     High/High    Unique       Yes
## 6                         5 Medium/Medium    Unique       Yes
## 7                         4      High/Low    Unique       Yes
## 8                         5 Medium/Medium    Unique       Yes
## 9                         4 Medium/Medium    Unique       Yes
## 10                        4     High/High    Unique       Yes
##    release_clause_eur
## 1           144300000
## 2           197200000
## 3            83300000
## 4           238700000
## 5           232200000
## 6           238000000
## 7           373500000
## 8            22300000
## 9           210400000
## 10          246100000
##                                                                                                player_tags
## 1            #Dribbler, #Distance Shooter, #FK Specialist, #Acrobat, #Clinical Finisher, #Complete Forward
## 2                                 #Aerial Threat, #Distance Shooter, #Clinical Finisher, #Complete Forward
## 3  #Aerial Threat, #Dribbler, #Distance Shooter, #Crosser, #Acrobat, #Clinical Finisher, #Complete Forward
## 4                        #Speedster, #Dribbler, #Playmaker, #FK Specialist, #Acrobat, #Complete Midfielder
## 5                        #Dribbler, #Playmaker, #Engine, #Distance Shooter, #Crosser, #Complete Midfielder
## 6                                                                                                         
## 7                                   #Speedster, #Dribbler, #Acrobat, #Clinical Finisher, #Complete Forward
## 8                                                                                                         
## 9                                                                                                         
## 10                                                                   #Distance Shooter, #Clinical Finisher
##                                                                                                                      player_traits
## 1  Finesse Shot, Long Shot Taker (AI), Playmaker (AI), Outside Foot Shot, One Club Player, Chip Shot (AI), Technical Dribbler (AI)
## 2                                                                    Solid Player, Finesse Shot, Outside Foot Shot, Chip Shot (AI)
## 3                                             Power Free-Kick, Flair, Long Shot Taker (AI), Speed Dribbler (AI), Outside Foot Shot
## 4                             Injury Prone, Flair, Speed Dribbler (AI), Playmaker (AI), Outside Foot Shot, Technical Dribbler (AI)
## 5               Injury Prone, Leadership, Early Crosser, Long Passer (AI), Long Shot Taker (AI), Playmaker (AI), Outside Foot Shot
## 6                                                                                                 GK Long Throw, Comes For Crosses
## 7                                                           Flair, Speed Dribbler (AI), Outside Foot Shot, Technical Dribbler (AI)
## 8                                                                 Leadership, GK Long Throw, Rushes Out Of Goal, Comes For Crosses
## 9                                                                           Rushes Out Of Goal, Comes For Crosses, Saves with Feet
## 10                                           Leadership, Long Passer (AI), Long Shot Taker (AI), Playmaker (AI), Outside Foot Shot
##    pace shooting passing dribbling defending physic attacking_crossing
## 1    85       92      91        95        34     65                 85
## 2    78       92      79        86        44     82                 71
## 3    87       94      80        88        34     75                 87
## 4    91       83      86        94        37     63                 85
## 5    76       86      93        88        64     78                 94
## 6    NA       NA      NA        NA        NA     NA                 13
## 7    97       88      80        92        36     77                 78
## 8    NA       NA      NA        NA        NA     NA                 15
## 9    NA       NA      NA        NA        NA     NA                 18
## 10   70       91      83        83        47     83                 80
##    attacking_finishing attacking_heading_accuracy attacking_short_passing
## 1                   95                         70                      91
## 2                   95                         90                      85
## 3                   95                         90                      80
## 4                   83                         63                      86
## 5                   82                         55                      94
## 6                   11                         15                      43
## 7                   93                         72                      85
## 8                   13                         25                      60
## 9                   14                         11                      61
## 10                  94                         86                      85
##    attacking_volleys skill_dribbling skill_curve skill_fk_accuracy
## 1                 88              96          93                94
## 2                 89              85          79                85
## 3                 86              88          81                84
## 4                 86              95          88                87
## 5                 82              88          85                83
## 6                 13              12          13                14
## 7                 83              93          80                69
## 8                 11              30          14                11
## 9                 14              21          18                12
## 10                88              83          83                65
##    skill_long_passing skill_ball_control movement_acceleration
## 1                  91                 96                    91
## 2                  70                 88                    77
## 3                  77                 88                    85
## 4                  81                 95                    93
## 5                  93                 91                    76
## 6                  40                 30                    43
## 7                  71                 91                    97
## 8                  68                 46                    54
## 9                  63                 30                    38
## 10                 86                 85                    65
##    movement_sprint_speed movement_agility movement_reactions movement_balance
## 1                     80               91                 94               95
## 2                     79               77                 93               82
## 3                     88               86                 94               74
## 4                     89               96                 89               84
## 5                     76               79                 91               78
## 6                     60               67                 88               49
## 7                     97               92                 93               83
## 8                     60               51                 87               35
## 9                     50               39                 86               43
## 10                    74               71                 92               70
##    power_shot_power power_jumping power_stamina power_strength power_long_shots
## 1                86            68            72             69               94
## 2                90            85            76             86               87
## 3                94            95            77             77               93
## 4                80            64            81             53               81
## 5                91            63            89             74               91
## 6                59            78            41             78               12
## 7                86            78            88             77               82
## 8                68            77            43             80               16
## 9                66            79            35             78               10
## 10               91            79            83             85               86
##    mentality_aggression mentality_interceptions mentality_positioning
## 1                    44                      40                    93
## 2                    81                      49                    95
## 3                    63                      29                    95
## 4                    63                      37                    86
## 5                    76                      66                    88
## 6                    34                      19                    11
## 7                    62                      38                    92
## 8                    29                      30                    12
## 9                    43                      22                    11
## 10                   80                      44                    94
##    mentality_vision mentality_penalties mentality_composure
## 1                95                  75                  96
## 2                81                  90                  88
## 3                76                  88                  95
## 4                90                  93                  93
## 5                94                  83                  89
## 6                65                  11                  68
## 7                82                  79                  88
## 8                70                  47                  70
## 9                70                  25                  70
## 10               87                  91                  91
##    defending_marking_awareness defending_standing_tackle
## 1                           20                        35
## 2                           35                        42
## 3                           24                        32
## 4                           35                        32
## 5                           68                        65
## 6                           27                        12
## 7                           26                        34
## 8                           17                        10
## 9                           25                        13
## 10                          50                        36
##    defending_sliding_tackle goalkeeping_diving goalkeeping_handling
## 1                        24                  6                   11
## 2                        19                 15                    6
## 3                        24                  7                   11
## 4                        29                  9                    9
## 5                        53                 15                   13
## 6                        18                 87                   92
## 7                        32                 13                    5
## 8                        11                 88                   88
## 9                        10                 88                   85
## 10                       38                  8                   10
##    goalkeeping_kicking goalkeeping_positioning goalkeeping_reflexes
## 1                   15                      14                    8
## 2                   12                       8                   10
## 3                   15                      14                   11
## 4                   15                      15                   11
## 5                    5                      10                   13
## 6                   78                      90                   90
## 7                    7                      11                    6
## 8                   91                      89                   88
## 9                   88                      88                   90
## 10                  11                      14                   11
##    goalkeeping_speed   ls   st   rs lw lf cf rf rw  lam  cam  ram   lm  lcm
## 1                 NA 89+3 89+3 89+3 92 93 93 93 92   93   93   93 91+2 87+3
## 2                 NA 90+2 90+2 90+2 85 88 88 88 85 86+3 86+3 86+3 84+3 80+3
## 3                 NA 90+1 90+1 90+1 88 89 89 89 88 86+3 86+3 86+3 86+3 78+3
## 4                 NA 83+3 83+3 83+3 90 88 88 88 90 89+2 89+2 89+2 89+2 82+3
## 5                 NA 83+3 83+3 83+3 88 87 87 87 88 89+2 89+2 89+2 89+2 89+2
## 6                 50 33+3 33+3 33+3 32 35 35 35 32 38+3 38+3 38+3 35+3 38+3
## 7                 NA 89+3 89+3 89+3 90 90 90 90 90 89+3 89+3 89+3 89+3 81+3
## 8                 56 40+3 40+3 40+3 40 43 43 43 40 47+3 47+3 47+3 44+3 50+3
## 9                 43 35+3 35+3 35+3 35 38 38 38 35 42+3 42+3 42+3 39+3 45+3
## 10                NA 88+2 88+2 88+2 84 86 86 86 84 85+3 85+3 85+3 84+3 82+3
##      cm  rcm   rm  lwb  ldm  cdm  rdm  rwb   lb  lcb   cb  rcb   rb   gk
## 1  87+3 87+3 91+2 66+3 64+3 64+3 64+3 66+3 61+3 50+3 50+3 50+3 61+3 19+3
## 2  80+3 80+3 84+3 64+3 66+3 66+3 66+3 64+3 61+3 60+3 60+3 60+3 61+3 19+3
## 3  78+3 78+3 86+3 63+3 59+3 59+3 59+3 63+3 60+3 53+3 53+3 53+3 60+3 20+3
## 4  82+3 82+3 89+2 67+3 63+3 63+3 63+3 67+3 62+3 50+3 50+3 50+3 62+3 20+3
## 5  89+2 89+2 89+2 79+3 80+3 80+3 80+3 79+3 75+3 69+3 69+3 69+3 75+3 21+3
## 6  38+3 38+3 35+3 32+3 36+3 36+3 36+3 32+3 32+3 33+3 33+3 33+3 32+3 89+3
## 7  81+3 81+3 89+3 67+3 63+3 63+3 63+3 67+3 63+3 54+3 54+3 54+3 63+3 18+3
## 8  50+3 50+3 44+3 37+3 43+3 43+3 43+3 37+3 35+3 34+3 34+3 34+3 35+3 88+2
## 9  45+3 45+3 39+3 33+3 41+3 41+3 41+3 33+3 31+3 33+3 33+3 33+3 31+3 88+3
## 10 82+3 82+3 84+3 67+3 68+3 68+3 68+3 67+3 64+3 61+3 61+3 61+3 64+3 20+3
##                                      player_face_url
## 1  https://cdn.sofifa.net/players/158/023/22_120.png
## 2  https://cdn.sofifa.net/players/188/545/22_120.png
## 3  https://cdn.sofifa.net/players/020/801/22_120.png
## 4  https://cdn.sofifa.net/players/190/871/22_120.png
## 5  https://cdn.sofifa.net/players/192/985/22_120.png
## 6  https://cdn.sofifa.net/players/200/389/22_120.png
## 7  https://cdn.sofifa.net/players/231/747/22_120.png
## 8  https://cdn.sofifa.net/players/167/495/22_120.png
## 9  https://cdn.sofifa.net/players/192/448/22_120.png
## 10 https://cdn.sofifa.net/players/202/126/22_120.png
##                              club_logo_url
## 1   https://cdn.sofifa.net/teams/73/60.png
## 2   https://cdn.sofifa.net/teams/21/60.png
## 3   https://cdn.sofifa.net/teams/11/60.png
## 4   https://cdn.sofifa.net/teams/73/60.png
## 5   https://cdn.sofifa.net/teams/10/60.png
## 6  https://cdn.sofifa.net/teams/240/60.png
## 7   https://cdn.sofifa.net/teams/73/60.png
## 8   https://cdn.sofifa.net/teams/21/60.png
## 9  https://cdn.sofifa.net/teams/241/60.png
## 10  https://cdn.sofifa.net/teams/18/60.png
##                              club_flag_url
## 1      https://cdn.sofifa.net/flags/fr.png
## 2      https://cdn.sofifa.net/flags/de.png
## 3  https://cdn.sofifa.net/flags/gb-eng.png
## 4      https://cdn.sofifa.net/flags/fr.png
## 5  https://cdn.sofifa.net/flags/gb-eng.png
## 6      https://cdn.sofifa.net/flags/es.png
## 7      https://cdn.sofifa.net/flags/fr.png
## 8      https://cdn.sofifa.net/flags/de.png
## 9      https://cdn.sofifa.net/flags/es.png
## 10 https://cdn.sofifa.net/flags/gb-eng.png
##                             nation_logo_url
## 1  https://cdn.sofifa.net/teams/1369/60.png
## 2  https://cdn.sofifa.net/teams/1353/60.png
## 3  https://cdn.sofifa.net/teams/1354/60.png
## 4                                          
## 5  https://cdn.sofifa.net/teams/1325/60.png
## 6                                          
## 7  https://cdn.sofifa.net/teams/1335/60.png
## 8  https://cdn.sofifa.net/teams/1337/60.png
## 9                                          
## 10 https://cdn.sofifa.net/teams/1318/60.png
##                            nation_flag_url
## 1      https://cdn.sofifa.net/flags/ar.png
## 2      https://cdn.sofifa.net/flags/pl.png
## 3      https://cdn.sofifa.net/flags/pt.png
## 4      https://cdn.sofifa.net/flags/br.png
## 5      https://cdn.sofifa.net/flags/be.png
## 6      https://cdn.sofifa.net/flags/si.png
## 7      https://cdn.sofifa.net/flags/fr.png
## 8      https://cdn.sofifa.net/flags/de.png
## 9      https://cdn.sofifa.net/flags/de.png
## 10 https://cdn.sofifa.net/flags/gb-eng.png

We perform a rough removal of all the features that will obviously not be relevant to our classification, or some of the ones that are a obvious linear composition of other features. Moreover, our training will be performed on the league 1 players. Then, we check the dimensions again.

players_full <- players_full[players_full$league_level == 1,]

players_22 <- subset(players_full, select = c("short_name","player_positions","age","height_cm","weight_kg","pace","shooting","passing","preferred_foot","weak_foot","dribbling","defending","physic","attacking_crossing","attacking_finishing","attacking_heading_accuracy","attacking_short_passing","attacking_volleys","skill_dribbling","skill_curve","skill_fk_accuracy","skill_long_passing","skill_ball_control","movement_acceleration","movement_sprint_speed","movement_agility","movement_reactions","movement_balance","power_shot_power","power_jumping","power_stamina","power_strength","power_long_shots","mentality_aggression","mentality_interceptions","mentality_positioning","mentality_vision","mentality_penalties","mentality_composure","defending_marking_awareness","defending_standing_tackle","defending_sliding_tackle"))


dim(players_22)
## [1] 14918    42

Apparently we kept only 42 features. Good enough. We will remove more later by performing feature selection so stay tuned.

head(players_22, n=5)
##          short_name player_positions age height_cm weight_kg pace shooting
## 1          L. Messi       RW, ST, CF  34       170        72   85       92
## 2    R. Lewandowski               ST  32       185        81   78       92
## 3 Cristiano Ronaldo           ST, LW  36       187        83   87       94
## 4         Neymar Jr          LW, CAM  29       175        68   91       83
## 5      K. De Bruyne          CM, CAM  30       181        70   76       86
##   passing preferred_foot weak_foot dribbling defending physic
## 1      91           Left         4        95        34     65
## 2      79          Right         4        86        44     82
## 3      80          Right         4        88        34     75
## 4      86          Right         5        94        37     63
## 5      93          Right         5        88        64     78
##   attacking_crossing attacking_finishing attacking_heading_accuracy
## 1                 85                  95                         70
## 2                 71                  95                         90
## 3                 87                  95                         90
## 4                 85                  83                         63
## 5                 94                  82                         55
##   attacking_short_passing attacking_volleys skill_dribbling skill_curve
## 1                      91                88              96          93
## 2                      85                89              85          79
## 3                      80                86              88          81
## 4                      86                86              95          88
## 5                      94                82              88          85
##   skill_fk_accuracy skill_long_passing skill_ball_control movement_acceleration
## 1                94                 91                 96                    91
## 2                85                 70                 88                    77
## 3                84                 77                 88                    85
## 4                87                 81                 95                    93
## 5                83                 93                 91                    76
##   movement_sprint_speed movement_agility movement_reactions movement_balance
## 1                    80               91                 94               95
## 2                    79               77                 93               82
## 3                    88               86                 94               74
## 4                    89               96                 89               84
## 5                    76               79                 91               78
##   power_shot_power power_jumping power_stamina power_strength power_long_shots
## 1               86            68            72             69               94
## 2               90            85            76             86               87
## 3               94            95            77             77               93
## 4               80            64            81             53               81
## 5               91            63            89             74               91
##   mentality_aggression mentality_interceptions mentality_positioning
## 1                   44                      40                    93
## 2                   81                      49                    95
## 3                   63                      29                    95
## 4                   63                      37                    86
## 5                   76                      66                    88
##   mentality_vision mentality_penalties mentality_composure
## 1               95                  75                  96
## 2               81                  90                  88
## 3               76                  88                  95
## 4               90                  93                  93
## 5               94                  83                  89
##   defending_marking_awareness defending_standing_tackle
## 1                          20                        35
## 2                          35                        42
## 3                          24                        32
## 4                          35                        32
## 5                          68                        65
##   defending_sliding_tackle
## 1                       24
## 2                       19
## 3                       24
## 4                       29
## 5                       53

We have a short look a numerical summary of all the features we selected. On a first glance they look like they need some normalization. But before that, we would love to make some visual presentations.

summary(players_22)
##   short_name        player_positions        age          height_cm  
##  Length:14918       Length:14918       Min.   :16.00   Min.   :155  
##  Class :character   Class :character   1st Qu.:21.00   1st Qu.:176  
##  Mode  :character   Mode  :character   Median :25.00   Median :181  
##                                        Mean   :25.34   Mean   :181  
##                                        3rd Qu.:29.00   3rd Qu.:186  
##                                        Max.   :54.00   Max.   :203  
##                                        NA's   :61      NA's   :61   
##    weight_kg           pace          shooting       passing     
##  Min.   : 49.00   Min.   :28.00   Min.   :18.0   Min.   :25.00  
##  1st Qu.: 70.00   1st Qu.:62.00   1st Qu.:42.0   1st Qu.:51.00  
##  Median : 75.00   Median :69.00   Median :55.0   Median :58.00  
##  Mean   : 74.84   Mean   :68.33   Mean   :52.8   Mean   :57.88  
##  3rd Qu.: 80.00   3rd Qu.:76.00   3rd Qu.:64.0   3rd Qu.:65.00  
##  Max.   :107.00   Max.   :97.00   Max.   :94.0   Max.   :93.00  
##  NA's   :61       NA's   :1725    NA's   :1725   NA's   :1725   
##  preferred_foot       weak_foot       dribbling       defending    
##  Length:14918       Min.   :1.000   Min.   :27.00   Min.   :15.00  
##  Class :character   1st Qu.:3.000   1st Qu.:57.00   1st Qu.:38.00  
##  Mode  :character   Median :3.000   Median :64.00   Median :56.00  
##                     Mean   :2.948   Mean   :62.99   Mean   :52.03  
##                     3rd Qu.:3.000   3rd Qu.:70.00   3rd Qu.:65.00  
##                     Max.   :5.000   Max.   :95.00   Max.   :91.00  
##                     NA's   :61      NA's   :1725    NA's   :1725   
##      physic      attacking_crossing attacking_finishing
##  Min.   :29.00   Min.   : 6         Min.   : 2.0       
##  1st Qu.:59.00   1st Qu.:39         1st Qu.:31.0       
##  Median :66.00   Median :54         Median :50.0       
##  Mean   :64.89   Mean   :50         Mean   :46.2       
##  3rd Qu.:72.00   3rd Qu.:64         3rd Qu.:62.0       
##  Max.   :90.00   Max.   :94         Max.   :95.0       
##  NA's   :1725    NA's   :61         NA's   :61         
##  attacking_heading_accuracy attacking_short_passing attacking_volleys
##  Min.   : 5.00              Min.   : 7.00           Min.   : 3.00    
##  1st Qu.:44.00              1st Qu.:55.00           1st Qu.:30.00    
##  Median :55.00              Median :63.00           Median :44.00    
##  Mean   :51.95              Mean   :59.33           Mean   :42.89    
##  3rd Qu.:64.00              3rd Qu.:69.00           3rd Qu.:57.00    
##  Max.   :93.00              Max.   :94.00           Max.   :90.00    
##  NA's   :61                 NA's   :61              NA's   :61       
##  skill_dribbling  skill_curve    skill_fk_accuracy skill_long_passing
##  Min.   : 4      Min.   : 6.00   Min.   : 4.00     Min.   : 9.00     
##  1st Qu.:50      1st Qu.:35.00   1st Qu.:31.00     1st Qu.:45.00     
##  Median :62      Median :49.00   Median :41.00     Median :57.00     
##  Mean   :56      Mean   :47.73   Mean   :42.65     Mean   :53.63     
##  3rd Qu.:69      3rd Qu.:62.00   3rd Qu.:56.00     3rd Qu.:65.00     
##  Max.   :96      Max.   :94.00   Max.   :94.00     Max.   :93.00     
##  NA's   :61      NA's   :61      NA's   :61        NA's   :61        
##  skill_ball_control movement_acceleration movement_sprint_speed
##  Min.   : 8.00      Min.   :14.0          Min.   :15.00        
##  1st Qu.:55.00      1st Qu.:58.0          1st Qu.:58.00        
##  Median :63.00      Median :68.0          Median :68.00        
##  Mean   :58.88      Mean   :64.7          Mean   :64.77        
##  3rd Qu.:70.00      3rd Qu.:75.0          3rd Qu.:75.00        
##  Max.   :96.00      Max.   :97.0          Max.   :97.00        
##  NA's   :61         NA's   :61            NA's   :61           
##  movement_agility movement_reactions movement_balance power_shot_power
##  Min.   :18.00    Min.   :25.00      Min.   :19.0     Min.   :20.00   
##  1st Qu.:55.00    1st Qu.:56.00      1st Qu.:56.0     1st Qu.:48.00   
##  Median :66.00    Median :62.00      Median :66.0     Median :59.00   
##  Mean   :63.55    Mean   :61.91      Mean   :64.1     Mean   :58.19   
##  3rd Qu.:74.00    3rd Qu.:68.00      3rd Qu.:74.0     3rd Qu.:68.00   
##  Max.   :96.00    Max.   :94.00      Max.   :96.0     Max.   :95.00   
##  NA's   :61       NA's   :61         NA's   :61       NA's   :61      
##  power_jumping   power_stamina   power_strength  power_long_shots
##  Min.   :24.00   Min.   :12.00   Min.   :19.00   Min.   : 4.00   
##  1st Qu.:57.00   1st Qu.:56.00   1st Qu.:57.00   1st Qu.:32.00   
##  Median :65.00   Median :67.00   Median :66.00   Median :51.00   
##  Mean   :64.75   Mean   :63.15   Mean   :64.97   Mean   :47.08   
##  3rd Qu.:73.00   3rd Qu.:74.00   3rd Qu.:74.00   3rd Qu.:63.00   
##  Max.   :95.00   Max.   :97.00   Max.   :96.00   Max.   :94.00   
##  NA's   :61      NA's   :61      NA's   :61      NA's   :61      
##  mentality_aggression mentality_interceptions mentality_positioning
##  Min.   :10.00        Min.   : 4.00           Min.   : 2.00        
##  1st Qu.:45.00        1st Qu.:26.00           1st Qu.:40.00        
##  Median :59.00        Median :53.00           Median :56.00        
##  Mean   :55.85        Mean   :46.95           Mean   :50.76        
##  3rd Qu.:69.00        3rd Qu.:64.00           3rd Qu.:65.00        
##  Max.   :95.00        Max.   :91.00           Max.   :96.00        
##  NA's   :61           NA's   :61              NA's   :61           
##  mentality_vision mentality_penalties mentality_composure
##  Min.   :10.00    Min.   : 7.00       Min.   :12.0       
##  1st Qu.:45.00    1st Qu.:38.00       1st Qu.:50.0       
##  Median :56.00    Median :49.00       Median :59.0       
##  Mean   :54.49    Mean   :48.11       Mean   :58.4       
##  3rd Qu.:65.00    3rd Qu.:60.00       3rd Qu.:67.0       
##  Max.   :95.00    Max.   :93.00       Max.   :96.0       
##  NA's   :61       NA's   :61          NA's   :61         
##  defending_marking_awareness defending_standing_tackle defending_sliding_tackle
##  Min.   : 4.00               Min.   : 5.00             Min.   : 5.00           
##  1st Qu.:29.00               1st Qu.:28.00             1st Qu.:26.00           
##  Median :52.00               Median :55.00             Median :53.00           
##  Mean   :46.86               Mean   :48.28             Mean   :46.12           
##  3rd Qu.:64.00               3rd Qu.:66.00             3rd Qu.:64.00           
##  Max.   :93.00               Max.   :93.00             Max.   :92.00           
##  NA's   :61                  NA's   :61                NA's   :61

2.2 Managing empty entries

We look at how many NAs we have on each attribute, in order to decide if we prefer removing them or filling them.

which(apply(X = players_22, MARGIN = 2, FUN = anyNA) == TRUE) # check for NA
##                  short_name            player_positions 
##                           1                           2 
##                         age                   height_cm 
##                           3                           4 
##                   weight_kg                        pace 
##                           5                           6 
##                    shooting                     passing 
##                           7                           8 
##              preferred_foot                   weak_foot 
##                           9                          10 
##                   dribbling                   defending 
##                          11                          12 
##                      physic          attacking_crossing 
##                          13                          14 
##         attacking_finishing  attacking_heading_accuracy 
##                          15                          16 
##     attacking_short_passing           attacking_volleys 
##                          17                          18 
##             skill_dribbling                 skill_curve 
##                          19                          20 
##           skill_fk_accuracy          skill_long_passing 
##                          21                          22 
##          skill_ball_control       movement_acceleration 
##                          23                          24 
##       movement_sprint_speed            movement_agility 
##                          25                          26 
##          movement_reactions            movement_balance 
##                          27                          28 
##            power_shot_power               power_jumping 
##                          29                          30 
##               power_stamina              power_strength 
##                          31                          32 
##            power_long_shots        mentality_aggression 
##                          33                          34 
##     mentality_interceptions       mentality_positioning 
##                          35                          36 
##            mentality_vision         mentality_penalties 
##                          37                          38 
##         mentality_composure defending_marking_awareness 
##                          39                          40 
##   defending_standing_tackle    defending_sliding_tackle 
##                          41                          42

We decide that we have a statistically dispensable number of NAs so we remove them.

players_22 <- na.omit(players_22) # delete NA
dim(players_22)
## [1] 13193    42

We still have a good chunk of the dataset left. Since goalkeepers have special stats, we also would like to take them out. First, we check how many we have.

goalkeepers <- str_detect(players_22$player_positions, "GK")
sum(goalkeepers)
## [1] 0

Thus, while they are indisposable on the field, we could not say the same about their data, as it would reduce the accuracy of the classification of the other main positions.

players_22<-subset(players_22, player_positions!="GK")

2.3 Labelling

Some players play in multiple positions, but we only want to identify their main one, so we only keep that one. Moreover, we turn the binary “preferred_foot” feature into a numerical type.

#Keep only the main preferred position
players_22$player_positions<- word(players_22$player_positions, 1, sep = fixed(","))
unique(players_22$player_positions)
##  [1] "RW"  "ST"  "LW"  "CM"  "CDM" "CF"  "LM"  "CB"  "CAM" "LB"  "RB"  "RM" 
## [13] "LWB" "RWB"
# Left foot is -1 and Right foot is 1. Basically one-hot encoding but we only have 2 categories so its easy
players_22$preferred_foot[players_22[,"preferred_foot"]== "Left"] <- as.numeric(-1)
players_22$preferred_foot[players_22[,"preferred_foot"]== "Right"] <- as.numeric(1)
players_22$preferred_foot <- as.numeric(players_22$preferred_foot)
# now we group them into the main 9 positions

Now, we take a look at the positions, and we plan to group them depending on the area of the field that they play in.

Goalkeeper excluded, there are 26 positions, namely:

  1. LWB = Left Wing Back
  2. LB = Left Back
  3. LCB = Left Center Back
  4. CB = Center Back
  5. RCB = Right Center Back
  6. RB = Right Back
  7. RWB = Right Wing Back
  8. LDM = Left Defensive Midfield
  9. CDM = Center Defensive Midfield
  10. RDM = Right Defensive Midfield
  11. RCM = Right Center Midfield
  12. CM = Center Midfield
  13. LCM = Left Center Midfield
  14. RAM = Right Attacking Midfield
  15. CAM = Center Attacking Midfield
  16. LAM = Left Attacking Midfield
  17. LM = Left Midfield
  18. RM = Right Midfield
  19. LW = Left Winger
  20. RW = Right Winger
  21. LF = Left Forward
  22. CF = Center Forward
  23. RF = Right Striker
  24. LS = Left Striker
  25. ST = Striker
  26. RS = Right Striker

As mentioned above, since 26 labels positions are clearly too many, we cluster them into nine classes of positions based on area of action on the field.

Note: This is probably the only part where we applied our “domain knowledge”.

#central back
players_22$player_positions[players_22[,"player_positions"]== "LCB"|players_22[,"player_positions"]== "CB"|players_22[,"player_positions"]== "RCB"] <- "CB"

#left back
players_22$player_positions[players_22[,"player_positions"]== "LWB"|players_22[,"player_positions"]== "LB"]<-"LB"

#right back
players_22$player_positions[players_22[,"player_positions"]== "RWB"|players_22[,"player_positions"]== "RB"]<-"RB"

#central deffensive midfielder
players_22$player_positions[players_22[,"player_positions"]== "LDM"|players_22[,"player_positions"]== "CDM"|players_22[,"player_positions"]== "RDM"] <- "CDM"

#central midfielder
players_22$player_positions[players_22[,"player_positions"]== "LCM"|players_22[,"player_positions"]== "CM"|players_22[,"player_positions"]== "RCM"] <- "CM"

#central attacking midfielder
players_22$player_positions[players_22[,"player_positions"]== "LAM"|players_22[,"player_positions"]== "CAM"|players_22[,"player_positions"]== "RAM"] <- "CAM"

#left winger
players_22$player_positions[players_22[,"player_positions"]== "LM"|players_22[,"player_positions"]== "LW"|players_22[,"player_positions"]== "LF"] <- "LW"

#right winger
players_22$player_positions[players_22[,"player_positions"]== "RM"|players_22[,"player_positions"]== "RW"|players_22[,"player_positions"]== "RF"] <- "RW"

#striker
players_22$player_positions[players_22[,"player_positions"]== "LS"|players_22[,"player_positions"]== "CF"|players_22[,"player_positions"]== "RS"] <- "ST"

Lets take a look at the distribution of our labels

cat<- table(factor(players_22$player_positions))
pie(cat,
    col = hcl.colors(length(cat), "BluYl"))

Time to normalize the numerical values, as promised. For that, we implement a simple re-scaling function, and we apply it on the whole dataframe.

# normalization function
normalize <-function(x) { (x -min(x))/(max(x)-min(x))   }

# normalize 
players_norm <- as.data.frame(lapply(players_22[, c(3:42)], normalize))
head(players_norm,5)
##         age height_cm weight_kg      pace  shooting   passing preferred_foot
## 1 0.4736842 0.3125000 0.4423077 0.8260870 0.9736842 0.9705882              0
## 2 0.4210526 0.6250000 0.6153846 0.7246377 0.9736842 0.7941176              1
## 3 0.5263158 0.6666667 0.6538462 0.8550725 1.0000000 0.8088235              1
## 4 0.3421053 0.4166667 0.3653846 0.9130435 0.8552632 0.8970588              1
## 5 0.3684211 0.5416667 0.4038462 0.6956522 0.8947368 1.0000000              1
##   weak_foot dribbling defending    physic attacking_crossing
## 1      0.75 1.0000000 0.2500000 0.5901639          0.8860759
## 2      0.75 0.8676471 0.3815789 0.8688525          0.7088608
## 3      0.75 0.8970588 0.2500000 0.7540984          0.9113924
## 4      1.00 0.9852941 0.2894737 0.5573770          0.8860759
## 5      1.00 0.8970588 0.6447368 0.8032787          1.0000000
##   attacking_finishing attacking_heading_accuracy attacking_short_passing
## 1           1.0000000                  0.6973684               0.9577465
## 2           1.0000000                  0.9605263               0.8732394
## 3           1.0000000                  0.9605263               0.8028169
## 4           0.8588235                  0.6052632               0.8873239
## 5           0.8470588                  0.5000000               1.0000000
##   attacking_volleys skill_dribbling skill_curve skill_fk_accuracy
## 1            0.9750       1.0000000   0.9878049         1.0000000
## 2            0.9875       0.8589744   0.8170732         0.8928571
## 3            0.9500       0.8974359   0.8414634         0.8809524
## 4            0.9500       0.9871795   0.9268293         0.9166667
## 5            0.9000       0.8974359   0.8902439         0.8690476
##   skill_long_passing skill_ball_control movement_acceleration
## 1          0.9726027          1.0000000             0.9142857
## 2          0.6849315          0.8888889             0.7142857
## 3          0.7808219          0.8888889             0.8285714
## 4          0.8356164          0.9861111             0.9428571
## 5          1.0000000          0.9305556             0.7000000
##   movement_sprint_speed movement_agility movement_reactions movement_balance
## 1             0.7571429        0.9275362          1.0000000        0.9857143
## 2             0.7428571        0.7246377          0.9846154        0.8000000
## 3             0.8714286        0.8550725          1.0000000        0.6857143
## 4             0.8857143        1.0000000          0.9230769        0.8285714
## 5             0.7000000        0.7536232          0.9538462        0.7428571
##   power_shot_power power_jumping power_stamina power_strength power_long_shots
## 1        0.8800000     0.5909091     0.6575342      0.6493506        1.0000000
## 2        0.9333333     0.8484848     0.7123288      0.8701299        0.9156627
## 3        0.9866667     1.0000000     0.7260274      0.7532468        0.9879518
## 4        0.8000000     0.5303030     0.7808219      0.4415584        0.8433735
## 5        0.9466667     0.5151515     0.8904110      0.7142857        0.9638554
##   mentality_aggression mentality_interceptions mentality_positioning
## 1            0.3200000               0.3703704             0.9642857
## 2            0.8133333               0.4814815             0.9880952
## 3            0.5733333               0.2345679             0.9880952
## 4            0.5733333               0.3333333             0.8809524
## 5            0.7466667               0.6913580             0.9047619
##   mentality_vision mentality_penalties mentality_composure
## 1        1.0000000              0.7750           1.0000000
## 2        0.8292683              0.9625           0.8787879
## 3        0.7682927              0.9375           0.9848485
## 4        0.9390244              1.0000           0.9545455
## 5        0.9878049              0.8750           0.8939394
##   defending_marking_awareness defending_standing_tackle
## 1                   0.1204819                 0.3012048
## 2                   0.3012048                 0.3855422
## 3                   0.1686747                 0.2650602
## 4                   0.3012048                 0.2650602
## 5                   0.6987952                 0.6626506
##   defending_sliding_tackle
## 1                0.1707317
## 2                0.1097561
## 3                0.1707317
## 4                0.2317073
## 5                0.5243902

2.3 Correlation matrix and feature selection

We create a correlation matrix. It is big and maybe a bit hard to read, but R gives us the visually appealing option to group plotted features into highly correlated clusters.

cormatrix <- cor(players_norm)
corrplot(cor(players_norm), method = 'shade', sig.level = 0.10, type = 'lower', order = 'hclust', title = "Correlation plot before feature selection")

Now, in order to reduce the number of features, we take away the ones that provide the data with the highest overall correlation.

highcorr <- findCorrelation(cormatrix, cutoff=0.8)
highcorr
##  [1]  9  5  6 17 34 13 31 21 15 10 22 40 39 33 11 23
col2<-colnames(players_norm)
col2
##  [1] "age"                         "height_cm"                  
##  [3] "weight_kg"                   "pace"                       
##  [5] "shooting"                    "passing"                    
##  [7] "preferred_foot"              "weak_foot"                  
##  [9] "dribbling"                   "defending"                  
## [11] "physic"                      "attacking_crossing"         
## [13] "attacking_finishing"         "attacking_heading_accuracy" 
## [15] "attacking_short_passing"     "attacking_volleys"          
## [17] "skill_dribbling"             "skill_curve"                
## [19] "skill_fk_accuracy"           "skill_long_passing"         
## [21] "skill_ball_control"          "movement_acceleration"      
## [23] "movement_sprint_speed"       "movement_agility"           
## [25] "movement_reactions"          "movement_balance"           
## [27] "power_shot_power"            "power_jumping"              
## [29] "power_stamina"               "power_strength"             
## [31] "power_long_shots"            "mentality_aggression"       
## [33] "mentality_interceptions"     "mentality_positioning"      
## [35] "mentality_vision"            "mentality_penalties"        
## [37] "mentality_composure"         "defending_marking_awareness"
## [39] "defending_standing_tackle"   "defending_sliding_tackle"
col2<-col2[-highcorr]
corrplot.mixed(cor(players_norm[highcorr]), lower = "number", upper="shade", tl.pos = 'lt')

Now we take a look if we eliminated some of the dark spots from our correlation matrix.

corrplot(cor(players_norm[col2]), type = 'lower',method = 'shade', order = 'hclust', title = "Correlation plot after feature selection")

players_model <- subset(players_norm)
#we can add the positions back
players_model$player_positions <- c(players_22$player_positions)

We did. Looks much better and ready for further investigation.

2.4 Individual feature investigation

We want to look at the individual distributions of each of the features left. We fit violin plots, and put boxplots on top of them.

#here we do the cool violin plots to check distributions
par(mfrow=c(4,2))
ggplot(data = melt(players_norm[,1:5]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using  as id variables

ggplot(data = melt(players_norm[,6:10]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using  as id variables

Weak foot is a discrete RV with values in 1-5. Preferred foot is +/-1, as discussed above. Still, as in real life, a significantly larger proportion of right-footed people.

ggplot(data = melt(players_norm[,11:15]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using  as id variables

ggplot(data = melt(players_norm[,16:20]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using  as id variables

ggplot(data = melt(players_norm[,21:25]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using  as id variables

ggplot(data = melt(players_norm[,26:30]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using  as id variables

ggplot(data = melt(players_norm[,31:35]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using  as id variables

ggplot(data = melt(players_norm[,36:40]), aes(y = variable, x = value, fill = variable, alpha = 0.7)) + geom_boxplot() + geom_violin() + scale_fill_manual(values = viridis(5)) + guides(fill = "none")
## Using  as id variables

2.5 Principal Component Analysis

players.pca<-prcomp(players_norm,center=TRUE, scale.=TRUE)
summary(players.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.8338 2.9386 2.0527 1.53692 1.08836 1.00962 0.90009
## Proportion of Variance 0.3674 0.2159 0.1053 0.05905 0.02961 0.02548 0.02025
## Cumulative Proportion  0.3674 0.5833 0.6887 0.74772 0.77733 0.80281 0.82307
##                            PC8     PC9    PC10    PC11    PC12   PC13    PC14
## Standard deviation     0.86204 0.78772 0.77920 0.70934 0.65328 0.6326 0.60847
## Proportion of Variance 0.01858 0.01551 0.01518 0.01258 0.01067 0.0100 0.00926
## Cumulative Proportion  0.84164 0.85716 0.87233 0.88491 0.89558 0.9056 0.91484
##                           PC15    PC16    PC17    PC18    PC19   PC20    PC21
## Standard deviation     0.57770 0.54221 0.51429 0.49240 0.48928 0.4732 0.46759
## Proportion of Variance 0.00834 0.00735 0.00661 0.00606 0.00598 0.0056 0.00547
## Cumulative Proportion  0.92319 0.93054 0.93715 0.94321 0.94920 0.9548 0.96026
##                           PC22    PC23    PC24    PC25    PC26    PC27   PC28
## Standard deviation     0.44503 0.43329 0.41573 0.39871 0.37169 0.35022 0.3462
## Proportion of Variance 0.00495 0.00469 0.00432 0.00397 0.00345 0.00307 0.0030
## Cumulative Proportion  0.96521 0.96990 0.97422 0.97820 0.98165 0.98472 0.9877
##                           PC29    PC30    PC31   PC32    PC33    PC34    PC35
## Standard deviation     0.32974 0.31684 0.31578 0.2825 0.26595 0.17110 0.02476
## Proportion of Variance 0.00272 0.00251 0.00249 0.0020 0.00177 0.00073 0.00002
## Cumulative Proportion  0.99043 0.99294 0.99544 0.9974 0.99920 0.99993 0.99995
##                           PC36    PC37    PC38    PC39    PC40
## Standard deviation     0.02424 0.02309 0.02131 0.01725 0.01537
## Proportion of Variance 0.00001 0.00001 0.00001 0.00001 0.00001
## Cumulative Proportion  0.99996 0.99998 0.99999 0.99999 1.00000

We obtain 40 components. We want to visualise them.

fviz_eig(players.pca, addlabels = TRUE)

The first 5 components account for 77.7% of the explained variance, while the first 2 for 58.3% of it. Now we want to see how our features project into the main 2D factor plane.

fviz_pca_var(players.pca, labelsize = 2, alpha.var = 1.0, title = "Factor Plane for the FIFA 22 Data")

3. Modelling - Multiclass classification

Now its finally time to dive into the actual modelling process. We experiment and compare different classification algorithms.

3.1 Train-validation split

Classical split for training and testing models. We keep the classical 70%-30% approach.

## 70% of the sample size
smp_size <- floor(0.7 * nrow(players_model))

train_ind <- sample(seq_len(nrow(players_model)), size = smp_size)

train <- players_model[train_ind, ]
test <- players_model[-train_ind, ]

print('Train set size:')
## [1] "Train set size:"
print(dim(test))
## [1] 3958   41
print('Validation set size:')
## [1] "Validation set size:"
print(dim(train))
## [1] 9235   41

We factorise the labes, so we can use them in our models.

#factorise labels
train_y <- as.factor(train[,41])
test_y <- as.factor(test[,41])
#remove labels from sets
train <- train[1:(length(train)-1)]
test <- test[1:(length(test)-1)]

Just to take a sneak peek, this is how the validation labels are roughly distributed on the factor plane.We notice that the factor plane sepparates some types of labels quite good, some not.

test.pca<-prcomp(test,center=TRUE, scale.=TRUE)
fviz_pca_biplot(test.pca,
                label = "all",
                col.ind = test_y,
                legend.title = "Players",
                title = "Classification of players")

3.2 Useful functions

Before we train any model, we want to create a function that computes accuracy, and one that selects the missclassified data so we can visualize it later on the factor plane.

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}

missclassified <- function(pred, label){
  l<- pred
  l[c(pred)==c(label)]<- 0
  return (as.factor(l))
}

3.3 Knn

##run knn function
class <- factor(c(train_y))

train <- train[1:(length(train)-1)]
test <- test[1:(length(test)-1)]

accuracy_vect <- c()
ks<- c()

for(k1 in seq(5,100,5)) {
    test_pred <-knn(train = train, test = test, cl = class, k = k1)
    accuracy_vect <- append(accuracy_vect,accuracy(table(test_y,test_pred)))
    ks <- append(ks, k1)
}

plot(ks, accuracy_vect, type = "p", col="blue", xlab="K's", ylab="accuracys", main="Accuracy vs K value plot")

We get the best k and its accuracy.

print('The best K in our case is:')
## [1] "The best K in our case is:"
print(ks[which.max(accuracy_vect)])
## [1] 25
print('And it gives us an accuracy of:' )
## [1] "And it gives us an accuracy of:"
print(accuracy_vect[which.max(accuracy_vect)])
## [1] 70.89439
test_pred <-knn(train = train, test = test, cl = class, k = 40)
df_pred=data.frame(test_y,test_pred)

We generate a confusion matrix to check misslabeled data

#Evaluate the model performance
CrossTable(x=test_y, y=test_pred,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3958 
## 
##  
##              | test_pred 
##       test_y |       CAM |        CB |       CDM |        CM |        LB |        LW |        RB |        RW |        ST | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          CAM |       114 |         0 |         0 |        73 |         6 |        32 |         1 |        48 |        27 |       301 | 
##              |     0.379 |     0.000 |     0.000 |     0.243 |     0.020 |     0.106 |     0.003 |     0.159 |     0.090 |     0.076 | 
##              |     0.556 |     0.000 |     0.000 |     0.106 |     0.014 |     0.124 |     0.003 |     0.152 |     0.039 |           | 
##              |     0.029 |     0.000 |     0.000 |     0.018 |     0.002 |     0.008 |     0.000 |     0.012 |     0.007 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CB |         0 |       646 |        26 |         4 |        29 |         0 |        20 |         0 |         0 |       725 | 
##              |     0.000 |     0.891 |     0.036 |     0.006 |     0.040 |     0.000 |     0.028 |     0.000 |     0.000 |     0.183 | 
##              |     0.000 |     0.909 |     0.088 |     0.006 |     0.065 |     0.000 |     0.057 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.163 |     0.007 |     0.001 |     0.007 |     0.000 |     0.005 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          CDM |         0 |        36 |       210 |       114 |        17 |         0 |        14 |         0 |         0 |       391 | 
##              |     0.000 |     0.092 |     0.537 |     0.292 |     0.043 |     0.000 |     0.036 |     0.000 |     0.000 |     0.099 | 
##              |     0.000 |     0.051 |     0.709 |     0.165 |     0.038 |     0.000 |     0.040 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.009 |     0.053 |     0.029 |     0.004 |     0.000 |     0.004 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CM |        24 |         4 |        40 |       385 |        22 |         1 |         0 |         5 |         0 |       481 | 
##              |     0.050 |     0.008 |     0.083 |     0.800 |     0.046 |     0.002 |     0.000 |     0.010 |     0.000 |     0.122 | 
##              |     0.117 |     0.006 |     0.135 |     0.558 |     0.050 |     0.004 |     0.000 |     0.016 |     0.000 |           | 
##              |     0.006 |     0.001 |     0.010 |     0.097 |     0.006 |     0.000 |     0.000 |     0.001 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           LB |         0 |        11 |         1 |         9 |       331 |         0 |        15 |         1 |         0 |       368 | 
##              |     0.000 |     0.030 |     0.003 |     0.024 |     0.899 |     0.000 |     0.041 |     0.003 |     0.000 |     0.093 | 
##              |     0.000 |     0.015 |     0.003 |     0.013 |     0.747 |     0.000 |     0.043 |     0.003 |     0.000 |           | 
##              |     0.000 |     0.003 |     0.000 |     0.002 |     0.084 |     0.000 |     0.004 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           LW |        30 |         0 |         2 |        22 |        29 |       115 |         1 |       100 |        52 |       351 | 
##              |     0.085 |     0.000 |     0.006 |     0.063 |     0.083 |     0.328 |     0.003 |     0.285 |     0.148 |     0.089 | 
##              |     0.146 |     0.000 |     0.007 |     0.032 |     0.065 |     0.446 |     0.003 |     0.317 |     0.076 |           | 
##              |     0.008 |     0.000 |     0.001 |     0.006 |     0.007 |     0.029 |     0.000 |     0.025 |     0.013 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           RB |         0 |        14 |        16 |        31 |         2 |         0 |       287 |         1 |         0 |       351 | 
##              |     0.000 |     0.040 |     0.046 |     0.088 |     0.006 |     0.000 |     0.818 |     0.003 |     0.000 |     0.089 | 
##              |     0.000 |     0.020 |     0.054 |     0.045 |     0.005 |     0.000 |     0.815 |     0.003 |     0.000 |           | 
##              |     0.000 |     0.004 |     0.004 |     0.008 |     0.001 |     0.000 |     0.073 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           RW |        28 |         0 |         0 |        42 |         7 |        82 |        14 |       127 |        46 |       346 | 
##              |     0.081 |     0.000 |     0.000 |     0.121 |     0.020 |     0.237 |     0.040 |     0.367 |     0.133 |     0.087 | 
##              |     0.137 |     0.000 |     0.000 |     0.061 |     0.016 |     0.318 |     0.040 |     0.403 |     0.067 |           | 
##              |     0.007 |     0.000 |     0.000 |     0.011 |     0.002 |     0.021 |     0.004 |     0.032 |     0.012 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           ST |         9 |         0 |         1 |        10 |         0 |        28 |         0 |        33 |       563 |       644 | 
##              |     0.014 |     0.000 |     0.002 |     0.016 |     0.000 |     0.043 |     0.000 |     0.051 |     0.874 |     0.163 | 
##              |     0.044 |     0.000 |     0.003 |     0.014 |     0.000 |     0.109 |     0.000 |     0.105 |     0.818 |           | 
##              |     0.002 |     0.000 |     0.000 |     0.003 |     0.000 |     0.007 |     0.000 |     0.008 |     0.142 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       205 |       711 |       296 |       690 |       443 |       258 |       352 |       315 |       688 |      3958 | 
##              |     0.052 |     0.180 |     0.075 |     0.174 |     0.112 |     0.065 |     0.089 |     0.080 |     0.174 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 
#creating confusion matrix
conf_mat <- confusion_matrix(targets = test_y,
                             predictions = test_pred)

Now we visualize it on the factor plane

fviz_pca_biplot(test.pca,
                label = "all",
                col.ind = missclassified(test_pred,test_y),
                legend.title = "Players",
                title = "Classification of labeled/misslabeled players for KNN")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 2778 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

3.4 Random Forrest

The hyperparameter we experiment with is the number of randomly sampled variables. Changing the number of trees does not do much, and from previous experimentation we realized that around 500 is the optimum value.

set.seed(123)
a=c()
i=5
for (i in 5:10) {
  model_RF <- randomForest(train_y ~ ., data = train, ntree = 500, mtry = i, importance = TRUE)
  prediction_RF <- predict(model_RF, test, type = "class")
  a[i-4] = mean(prediction_RF == test_y) # nicer way to do accuracy than we did
}
plot(5:10,a)

a
## [1] 0.7304194 0.7314300 0.7306721 0.7337039 0.7299141 0.7304194

a = 8 is the best one.

We plot missclassified labels again on the factor plane.

model_RF <- randomForest(train_y ~ ., data = train, ntree = 500, mtry = 8, importance = TRUE)
prediction_RF <- predict(model_RF, test, type = "class")
summary(model_RF)
##                 Length Class  Mode     
## call                6  -none- call     
## type                1  -none- character
## predicted        9235  factor numeric  
## err.rate         5000  -none- numeric  
## confusion          90  -none- numeric  
## votes           83115  matrix numeric  
## oob.times        9235  -none- numeric  
## classes             9  -none- character
## importance        429  -none- numeric  
## importanceSD      390  -none- numeric  
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             14  -none- list     
## y                9235  factor numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call
fviz_pca_biplot(test.pca,
                label = "all",
                col.ind = missclassified(prediction_RF,test_y),
                legend.title = "Players",
                title = "Classification of labeled/misslabeled players for RF")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 2902 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

We generate a confusion matrix to check misslabeled data

#Evaluate the model performance
CrossTable(x=test_y, y=prediction_RF,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3958 
## 
##  
##              | prediction_RF 
##       test_y |       CAM |        CB |       CDM |        CM |        LB |        LW |        RB |        RW |        ST | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          CAM |       157 |         0 |         0 |        56 |         3 |        23 |         0 |        37 |        25 |       301 | 
##              |     0.522 |     0.000 |     0.000 |     0.186 |     0.010 |     0.076 |     0.000 |     0.123 |     0.083 |     0.076 | 
##              |     0.618 |     0.000 |     0.000 |     0.093 |     0.008 |     0.107 |     0.000 |     0.102 |     0.036 |           | 
##              |     0.040 |     0.000 |     0.000 |     0.014 |     0.001 |     0.006 |     0.000 |     0.009 |     0.006 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CB |         0 |       665 |        24 |         0 |        20 |         0 |        16 |         0 |         0 |       725 | 
##              |     0.000 |     0.917 |     0.033 |     0.000 |     0.028 |     0.000 |     0.022 |     0.000 |     0.000 |     0.183 | 
##              |     0.000 |     0.890 |     0.071 |     0.000 |     0.054 |     0.000 |     0.043 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.168 |     0.006 |     0.000 |     0.005 |     0.000 |     0.004 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          CDM |         0 |        48 |       240 |        89 |         5 |         0 |         9 |         0 |         0 |       391 | 
##              |     0.000 |     0.123 |     0.614 |     0.228 |     0.013 |     0.000 |     0.023 |     0.000 |     0.000 |     0.099 | 
##              |     0.000 |     0.064 |     0.706 |     0.147 |     0.013 |     0.000 |     0.024 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.012 |     0.061 |     0.022 |     0.001 |     0.000 |     0.002 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CM |        23 |         0 |        58 |       386 |         5 |         2 |         3 |         4 |         0 |       481 | 
##              |     0.048 |     0.000 |     0.121 |     0.802 |     0.010 |     0.004 |     0.006 |     0.008 |     0.000 |     0.122 | 
##              |     0.091 |     0.000 |     0.171 |     0.639 |     0.013 |     0.009 |     0.008 |     0.011 |     0.000 |           | 
##              |     0.006 |     0.000 |     0.015 |     0.098 |     0.001 |     0.001 |     0.001 |     0.001 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           LB |         0 |        17 |         2 |         9 |       318 |         1 |        18 |         3 |         0 |       368 | 
##              |     0.000 |     0.046 |     0.005 |     0.024 |     0.864 |     0.003 |     0.049 |     0.008 |     0.000 |     0.093 | 
##              |     0.000 |     0.023 |     0.006 |     0.015 |     0.853 |     0.005 |     0.049 |     0.008 |     0.000 |           | 
##              |     0.000 |     0.004 |     0.001 |     0.002 |     0.080 |     0.000 |     0.005 |     0.001 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           LW |        26 |         0 |         2 |        19 |        20 |       100 |         1 |       138 |        45 |       351 | 
##              |     0.074 |     0.000 |     0.006 |     0.054 |     0.057 |     0.285 |     0.003 |     0.393 |     0.128 |     0.089 | 
##              |     0.102 |     0.000 |     0.006 |     0.031 |     0.054 |     0.467 |     0.003 |     0.380 |     0.065 |           | 
##              |     0.007 |     0.000 |     0.001 |     0.005 |     0.005 |     0.025 |     0.000 |     0.035 |     0.011 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           RB |         0 |        17 |        14 |        13 |         2 |         0 |       302 |         3 |         0 |       351 | 
##              |     0.000 |     0.048 |     0.040 |     0.037 |     0.006 |     0.000 |     0.860 |     0.009 |     0.000 |     0.089 | 
##              |     0.000 |     0.023 |     0.041 |     0.022 |     0.005 |     0.000 |     0.818 |     0.008 |     0.000 |           | 
##              |     0.000 |     0.004 |     0.004 |     0.003 |     0.001 |     0.000 |     0.076 |     0.001 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           RW |        37 |         0 |         0 |        24 |         0 |        70 |        19 |       153 |        43 |       346 | 
##              |     0.107 |     0.000 |     0.000 |     0.069 |     0.000 |     0.202 |     0.055 |     0.442 |     0.124 |     0.087 | 
##              |     0.146 |     0.000 |     0.000 |     0.040 |     0.000 |     0.327 |     0.051 |     0.421 |     0.062 |           | 
##              |     0.009 |     0.000 |     0.000 |     0.006 |     0.000 |     0.018 |     0.005 |     0.039 |     0.011 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           ST |        11 |         0 |         0 |         8 |         0 |        18 |         1 |        25 |       581 |       644 | 
##              |     0.017 |     0.000 |     0.000 |     0.012 |     0.000 |     0.028 |     0.002 |     0.039 |     0.902 |     0.163 | 
##              |     0.043 |     0.000 |     0.000 |     0.013 |     0.000 |     0.084 |     0.003 |     0.069 |     0.837 |           | 
##              |     0.003 |     0.000 |     0.000 |     0.002 |     0.000 |     0.005 |     0.000 |     0.006 |     0.147 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       254 |       747 |       340 |       604 |       373 |       214 |       369 |       363 |       694 |      3958 | 
##              |     0.064 |     0.189 |     0.086 |     0.153 |     0.094 |     0.054 |     0.093 |     0.092 |     0.175 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

3.5 SVM

svm1 <- svm(formula= train_y~., data=train, 
          type="C-classification", kernal="radial", 
          gamma=0.1, cost=10)

We produce a summary of the model.

prediction_svm <- predict(svm1,test, type = "class")
accuracy(table(test_y, prediction_svm))
## [1] 71.52602
summary(svm1)
## 
## Call:
## svm(formula = train_y ~ ., data = train, type = "C-classification", 
##     kernal = "radial", gamma = 0.1, cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  7205
## 
##  ( 979 588 874 809 694 792 1071 806 592 )
## 
## 
## Number of Classes:  9 
## 
## Levels: 
##  CAM CB CDM CM LB LW RB RW ST

We plot misslabeled data

fviz_pca_biplot(test.pca,
                label = "all",
                col.ind = missclassified(prediction_svm,test_y),
                legend.title = "Players",
                title = "Classification of labeled/misslabeled players for SVM")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 2831 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

We generate a confusion matrix to check misslabeled data

#Evaluate the model performance
CrossTable(x=test_y, y=prediction_svm,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3958 
## 
##  
##              | prediction_svm 
##       test_y |       CAM |        CB |       CDM |        CM |        LB |        LW |        RB |        RW |        ST | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          CAM |       134 |         0 |         1 |        59 |         0 |        35 |         1 |        47 |        24 |       301 | 
##              |     0.445 |     0.000 |     0.003 |     0.196 |     0.000 |     0.116 |     0.003 |     0.156 |     0.080 |     0.076 | 
##              |     0.558 |     0.000 |     0.003 |     0.109 |     0.000 |     0.117 |     0.003 |     0.123 |     0.037 |           | 
##              |     0.034 |     0.000 |     0.000 |     0.015 |     0.000 |     0.009 |     0.000 |     0.012 |     0.006 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CB |         0 |       660 |        26 |         3 |        19 |         0 |        17 |         0 |         0 |       725 | 
##              |     0.000 |     0.910 |     0.036 |     0.004 |     0.026 |     0.000 |     0.023 |     0.000 |     0.000 |     0.183 | 
##              |     0.000 |     0.862 |     0.073 |     0.006 |     0.051 |     0.000 |     0.048 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.167 |     0.007 |     0.001 |     0.005 |     0.000 |     0.004 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          CDM |         0 |        50 |       240 |        88 |         6 |         0 |         7 |         0 |         0 |       391 | 
##              |     0.000 |     0.128 |     0.614 |     0.225 |     0.015 |     0.000 |     0.018 |     0.000 |     0.000 |     0.099 | 
##              |     0.000 |     0.065 |     0.676 |     0.163 |     0.016 |     0.000 |     0.020 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.013 |     0.061 |     0.022 |     0.002 |     0.000 |     0.002 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CM |        37 |         4 |        75 |       349 |         3 |         6 |         0 |         5 |         2 |       481 | 
##              |     0.077 |     0.008 |     0.156 |     0.726 |     0.006 |     0.012 |     0.000 |     0.010 |     0.004 |     0.122 | 
##              |     0.154 |     0.005 |     0.211 |     0.647 |     0.008 |     0.020 |     0.000 |     0.013 |     0.003 |           | 
##              |     0.009 |     0.001 |     0.019 |     0.088 |     0.001 |     0.002 |     0.000 |     0.001 |     0.001 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           LB |         0 |        23 |         2 |         6 |       312 |         5 |        19 |         1 |         0 |       368 | 
##              |     0.000 |     0.062 |     0.005 |     0.016 |     0.848 |     0.014 |     0.052 |     0.003 |     0.000 |     0.093 | 
##              |     0.000 |     0.030 |     0.006 |     0.011 |     0.846 |     0.017 |     0.054 |     0.003 |     0.000 |           | 
##              |     0.000 |     0.006 |     0.001 |     0.002 |     0.079 |     0.001 |     0.005 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           LW |        31 |         1 |         2 |        11 |        19 |       121 |         2 |       126 |        38 |       351 | 
##              |     0.088 |     0.003 |     0.006 |     0.031 |     0.054 |     0.345 |     0.006 |     0.359 |     0.108 |     0.089 | 
##              |     0.129 |     0.001 |     0.006 |     0.020 |     0.051 |     0.406 |     0.006 |     0.330 |     0.058 |           | 
##              |     0.008 |     0.000 |     0.001 |     0.003 |     0.005 |     0.031 |     0.001 |     0.032 |     0.010 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           RB |         0 |        27 |         9 |         5 |         9 |         3 |       293 |         5 |         0 |       351 | 
##              |     0.000 |     0.077 |     0.026 |     0.014 |     0.026 |     0.009 |     0.835 |     0.014 |     0.000 |     0.089 | 
##              |     0.000 |     0.035 |     0.025 |     0.009 |     0.024 |     0.010 |     0.828 |     0.013 |     0.000 |           | 
##              |     0.000 |     0.007 |     0.002 |     0.001 |     0.002 |     0.001 |     0.074 |     0.001 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           RW |        28 |         0 |         0 |        12 |         1 |        94 |        14 |       164 |        33 |       346 | 
##              |     0.081 |     0.000 |     0.000 |     0.035 |     0.003 |     0.272 |     0.040 |     0.474 |     0.095 |     0.087 | 
##              |     0.117 |     0.000 |     0.000 |     0.022 |     0.003 |     0.315 |     0.040 |     0.429 |     0.050 |           | 
##              |     0.007 |     0.000 |     0.000 |     0.003 |     0.000 |     0.024 |     0.004 |     0.041 |     0.008 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           ST |        10 |         1 |         0 |         6 |         0 |        34 |         1 |        34 |       558 |       644 | 
##              |     0.016 |     0.002 |     0.000 |     0.009 |     0.000 |     0.053 |     0.002 |     0.053 |     0.866 |     0.163 | 
##              |     0.042 |     0.001 |     0.000 |     0.011 |     0.000 |     0.114 |     0.003 |     0.089 |     0.852 |           | 
##              |     0.003 |     0.000 |     0.000 |     0.002 |     0.000 |     0.009 |     0.000 |     0.009 |     0.141 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       240 |       766 |       355 |       539 |       369 |       298 |       354 |       382 |       655 |      3958 | 
##              |     0.061 |     0.194 |     0.090 |     0.136 |     0.093 |     0.075 |     0.089 |     0.097 |     0.165 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

3.6 Label Grouping

The accuracies obtained are decent but not great, and the confusion matrix clearly explains why. Positions like CB, ST, LB, RB get classified really well. On the opposite side, the most commonly misclassified position are CAM with CM, and RW with LW and viceversa.

The first misclassification is explainable with basic attributes of the role. Centrer Attacking Midfielder shares a lot of attacking characteristics with the Winger such as shooting and pace but also many with CM, like passing.

The second one is a bit more tricky to detect. For Left Back and Right Back the preferred foot plays a big role, since it’s hard to find a righty who plays on the left and viceversa, because they cross and tackle mostly with their dominant foot. For RW and LW the distinction is less definable based on the preferred foot. On one hand, a lot of righty players like to play as Left Winger so they can converge to the center to shoot with their strong foot. Same is true for lefty on RW. On the other hand, many Wingers like to cross more, so they tend to do it with their preferred foot (LW with left and RW with right). So for the model of course it’s really not an easy job to detect these differences that pertain to the single player style of play; and this problem explains the drop in accuracy for these positions. In order to improve the accuracy of our classifiers, we group RW and LW together in a new position ‘W = Winger’ and the CAM with CM.

test_y2 <- test_y
levels(test_y2)[levels(test_y2) == "RW"| levels(test_y2) == "LW"] <- "W"
levels(test_y2)[levels(test_y2) == "CAM"| levels(test_y2) == "CM"] <- "CM"


train_y2 <- train_y
levels(train_y2)[levels(train_y2) == "RW"| levels(train_y2) == "LW"] <- "W"
levels(train_y2)[levels(train_y2) == "CAM"| levels(train_y2) == "CM"] <- "CM"

unique(test_y2)
## [1] ST  W   CDM CM  RB  CB  LB 
## Levels: CM CB CDM LB W RB ST
#plot pie chart again
cat<- table(factor(test_y2))
pie(cat, col = hcl.colors(length(cat), "BluYl"))

This is the new distribution of labels. Now we reproduce the same experiments, expecting a hefty increase in accuracy, with the price of ablation. 3.6.1 Knn

 prediction_knn2 <-knn(train = train, test = test, cl = train_y2, k = 20)
 CrossTable(x=test_y2, y=prediction_knn2,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3958 
## 
##  
##              | prediction_knn2 
##      test_y2 |        CM |        CB |       CDM |        LB |         W |        RB |        ST | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CM |       597 |         3 |        39 |        13 |       108 |         1 |        21 |       782 | 
##              |     0.763 |     0.004 |     0.050 |     0.017 |     0.138 |     0.001 |     0.027 |     0.198 | 
##              |     0.678 |     0.004 |     0.134 |     0.032 |     0.156 |     0.003 |     0.034 |           | 
##              |     0.151 |     0.001 |     0.010 |     0.003 |     0.027 |     0.000 |     0.005 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CB |         2 |       651 |        28 |        24 |         0 |        20 |         0 |       725 | 
##              |     0.003 |     0.898 |     0.039 |     0.033 |     0.000 |     0.028 |     0.000 |     0.183 | 
##              |     0.002 |     0.905 |     0.096 |     0.059 |     0.000 |     0.058 |     0.000 |           | 
##              |     0.001 |     0.164 |     0.007 |     0.006 |     0.000 |     0.005 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          CDM |       121 |        36 |       205 |        14 |         0 |        15 |         0 |       391 | 
##              |     0.309 |     0.092 |     0.524 |     0.036 |     0.000 |     0.038 |     0.000 |     0.099 | 
##              |     0.137 |     0.050 |     0.702 |     0.034 |     0.000 |     0.043 |     0.000 |           | 
##              |     0.031 |     0.009 |     0.052 |     0.004 |     0.000 |     0.004 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           LB |        13 |        12 |         1 |       325 |         1 |        16 |         0 |       368 | 
##              |     0.035 |     0.033 |     0.003 |     0.883 |     0.003 |     0.043 |     0.000 |     0.093 | 
##              |     0.015 |     0.017 |     0.003 |     0.800 |     0.001 |     0.046 |     0.000 |           | 
##              |     0.003 |     0.003 |     0.000 |     0.082 |     0.000 |     0.004 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            W |        99 |         0 |         2 |        28 |       493 |        14 |        61 |       697 | 
##              |     0.142 |     0.000 |     0.003 |     0.040 |     0.707 |     0.020 |     0.088 |     0.176 | 
##              |     0.112 |     0.000 |     0.007 |     0.069 |     0.712 |     0.040 |     0.098 |           | 
##              |     0.025 |     0.000 |     0.001 |     0.007 |     0.125 |     0.004 |     0.015 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           RB |        31 |        17 |        16 |         2 |         5 |       280 |         0 |       351 | 
##              |     0.088 |     0.048 |     0.046 |     0.006 |     0.014 |     0.798 |     0.000 |     0.089 | 
##              |     0.035 |     0.024 |     0.055 |     0.005 |     0.007 |     0.809 |     0.000 |           | 
##              |     0.008 |     0.004 |     0.004 |     0.001 |     0.001 |     0.071 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           ST |        18 |         0 |         1 |         0 |        85 |         0 |       540 |       644 | 
##              |     0.028 |     0.000 |     0.002 |     0.000 |     0.132 |     0.000 |     0.839 |     0.163 | 
##              |     0.020 |     0.000 |     0.003 |     0.000 |     0.123 |     0.000 |     0.868 |           | 
##              |     0.005 |     0.000 |     0.000 |     0.000 |     0.021 |     0.000 |     0.136 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       881 |       719 |       292 |       406 |       692 |       346 |       622 |      3958 | 
##              |     0.223 |     0.182 |     0.074 |     0.103 |     0.175 |     0.087 |     0.157 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

The confusion matrix looks much better

  accuracy(table(prediction_knn2, test_y2))
## [1] 78.095
fviz_pca_biplot(test.pca,
                label = "all",
                col.ind = missclassified(prediction_knn2, test_y2),
                legend.title = "Players",
                title = "Classification of labeled/misslabeled players for KNN2")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 3091 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

3.6.2 Random Forrest

 model_RF2 <- randomForest(train_y2 ~ ., data = train, ntree = 500, mtry = 8, importance = TRUE)
prediction_RF2 <- predict(model_RF2, test, type = "class")
summary(model_RF2)
##                 Length Class  Mode     
## call                6  -none- call     
## type                1  -none- character
## predicted        9235  factor numeric  
## err.rate         4000  -none- numeric  
## confusion          56  -none- numeric  
## votes           64645  matrix numeric  
## oob.times        9235  -none- numeric  
## classes             7  -none- character
## importance        351  -none- numeric  
## importanceSD      312  -none- numeric  
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             14  -none- list     
## y                9235  factor numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call
accuracy(table(prediction_RF2, test_y2))
## [1] 80.29308
fviz_pca_biplot(test.pca,
                label = "all",
                col.ind = missclassified(prediction_RF2,test_y2),
                legend.title = "Players",
                title = "Classification of labeled/misslabeled players for RF2")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 3178 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

We generate a confusion matrix to check misslabeled data

#Evaluate the model performance
CrossTable(x=test_y, y=prediction_RF2,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3958 
## 
##  
##              | prediction_RF2 
##       test_y |        CM |        CB |       CDM |        LB |         W |        RB |        ST | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          CAM |       183 |         0 |         0 |         0 |        98 |         0 |        20 |       301 | 
##              |     0.608 |     0.000 |     0.000 |     0.000 |     0.326 |     0.000 |     0.066 |     0.076 | 
##              |     0.223 |     0.000 |     0.000 |     0.000 |     0.138 |     0.000 |     0.032 |           | 
##              |     0.046 |     0.000 |     0.000 |     0.000 |     0.025 |     0.000 |     0.005 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CB |         1 |       662 |        24 |        21 |         0 |        17 |         0 |       725 | 
##              |     0.001 |     0.913 |     0.033 |     0.029 |     0.000 |     0.023 |     0.000 |     0.183 | 
##              |     0.001 |     0.886 |     0.072 |     0.058 |     0.000 |     0.047 |     0.000 |           | 
##              |     0.000 |     0.167 |     0.006 |     0.005 |     0.000 |     0.004 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          CDM |        93 |        48 |       234 |         5 |         1 |        10 |         0 |       391 | 
##              |     0.238 |     0.123 |     0.598 |     0.013 |     0.003 |     0.026 |     0.000 |     0.099 | 
##              |     0.114 |     0.064 |     0.705 |     0.014 |     0.001 |     0.028 |     0.000 |           | 
##              |     0.023 |     0.012 |     0.059 |     0.001 |     0.000 |     0.003 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CM |       414 |         0 |        54 |         4 |         6 |         3 |         0 |       481 | 
##              |     0.861 |     0.000 |     0.112 |     0.008 |     0.012 |     0.006 |     0.000 |     0.122 | 
##              |     0.505 |     0.000 |     0.163 |     0.011 |     0.008 |     0.008 |     0.000 |           | 
##              |     0.105 |     0.000 |     0.014 |     0.001 |     0.002 |     0.001 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           LB |        10 |        19 |         3 |       312 |         6 |        18 |         0 |       368 | 
##              |     0.027 |     0.052 |     0.008 |     0.848 |     0.016 |     0.049 |     0.000 |     0.093 | 
##              |     0.012 |     0.025 |     0.009 |     0.864 |     0.008 |     0.050 |     0.000 |           | 
##              |     0.003 |     0.005 |     0.001 |     0.079 |     0.002 |     0.005 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           LW |        37 |         0 |         2 |        16 |       269 |         1 |        26 |       351 | 
##              |     0.105 |     0.000 |     0.006 |     0.046 |     0.766 |     0.003 |     0.074 |     0.089 | 
##              |     0.045 |     0.000 |     0.006 |     0.044 |     0.378 |     0.003 |     0.042 |           | 
##              |     0.009 |     0.000 |     0.001 |     0.004 |     0.068 |     0.000 |     0.007 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           RB |        14 |        18 |        15 |         2 |         4 |       298 |         0 |       351 | 
##              |     0.040 |     0.051 |     0.043 |     0.006 |     0.011 |     0.849 |     0.000 |     0.089 | 
##              |     0.017 |     0.024 |     0.045 |     0.006 |     0.006 |     0.821 |     0.000 |           | 
##              |     0.004 |     0.005 |     0.004 |     0.001 |     0.001 |     0.075 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           RW |        51 |         0 |         0 |         1 |       253 |        15 |        26 |       346 | 
##              |     0.147 |     0.000 |     0.000 |     0.003 |     0.731 |     0.043 |     0.075 |     0.087 | 
##              |     0.062 |     0.000 |     0.000 |     0.003 |     0.356 |     0.041 |     0.042 |           | 
##              |     0.013 |     0.000 |     0.000 |     0.000 |     0.064 |     0.004 |     0.007 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           ST |        16 |         0 |         0 |         0 |        74 |         1 |       553 |       644 | 
##              |     0.025 |     0.000 |     0.000 |     0.000 |     0.115 |     0.002 |     0.859 |     0.163 | 
##              |     0.020 |     0.000 |     0.000 |     0.000 |     0.104 |     0.003 |     0.885 |           | 
##              |     0.004 |     0.000 |     0.000 |     0.000 |     0.019 |     0.000 |     0.140 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       819 |       747 |       332 |       361 |       711 |       363 |       625 |      3958 | 
##              |     0.207 |     0.189 |     0.084 |     0.091 |     0.180 |     0.092 |     0.158 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

3.6.3 SVM

svm2 <- svm(formula= train_y2~., data=train, 
          type="C-classification", kernal="radial", 
          gamma=0.1, cost=10)
prediction_svm2 <- predict(svm2, test, type = "class")
summary(svm2)
## 
## Call:
## svm(formula = train_y2 ~ ., data = train, type = "C-classification", 
##     kernal = "radial", gamma = 0.1, cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  6667
## 
##  ( 978 585 1339 808 690 1487 780 )
## 
## 
## Number of Classes:  7 
## 
## Levels: 
##  CM CB CDM LB W RB ST
accuracy(table(test_y2, prediction_svm2))
## [1] 79.66145
fviz_pca_biplot(test.pca,
                label = "all",
                col.ind = missclassified(prediction_RF2,test_y2),
                legend.title = "Players",
                title = "Classification of labeled/misslabeled players for RF2")
## Warning in `[<-.factor`(`*tmp*`, c(pred) == c(label), value = 0): invalid factor
## level, NA generated
## Warning: Removed 3178 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

We generate a confusion matrix to check misslabeled data

#Evaluate the model performance
CrossTable(x=test_y, y=prediction_svm2,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3958 
## 
##  
##              | prediction_svm2 
##       test_y |        CM |        CB |       CDM |        LB |         W |        RB |        ST | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          CAM |       199 |         0 |         1 |         0 |        84 |         1 |        16 |       301 | 
##              |     0.661 |     0.000 |     0.003 |     0.000 |     0.279 |     0.003 |     0.053 |     0.076 | 
##              |     0.248 |     0.000 |     0.003 |     0.000 |     0.116 |     0.003 |     0.026 |           | 
##              |     0.050 |     0.000 |     0.000 |     0.000 |     0.021 |     0.000 |     0.004 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CB |         4 |       659 |        26 |        19 |         0 |        17 |         0 |       725 | 
##              |     0.006 |     0.909 |     0.036 |     0.026 |     0.000 |     0.023 |     0.000 |     0.183 | 
##              |     0.005 |     0.866 |     0.077 |     0.052 |     0.000 |     0.049 |     0.000 |           | 
##              |     0.001 |     0.166 |     0.007 |     0.005 |     0.000 |     0.004 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          CDM |        97 |        50 |       231 |         5 |         1 |         7 |         0 |       391 | 
##              |     0.248 |     0.128 |     0.591 |     0.013 |     0.003 |     0.018 |     0.000 |     0.099 | 
##              |     0.121 |     0.066 |     0.681 |     0.014 |     0.001 |     0.020 |     0.000 |           | 
##              |     0.025 |     0.013 |     0.058 |     0.001 |     0.000 |     0.002 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           CM |       394 |         3 |        69 |         3 |        11 |         0 |         1 |       481 | 
##              |     0.819 |     0.006 |     0.143 |     0.006 |     0.023 |     0.000 |     0.002 |     0.122 | 
##              |     0.491 |     0.004 |     0.204 |     0.008 |     0.015 |     0.000 |     0.002 |           | 
##              |     0.100 |     0.001 |     0.017 |     0.001 |     0.003 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           LB |         6 |        23 |         2 |       311 |         9 |        17 |         0 |       368 | 
##              |     0.016 |     0.062 |     0.005 |     0.845 |     0.024 |     0.046 |     0.000 |     0.093 | 
##              |     0.007 |     0.030 |     0.006 |     0.852 |     0.012 |     0.049 |     0.000 |           | 
##              |     0.002 |     0.006 |     0.001 |     0.079 |     0.002 |     0.004 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           LW |        40 |         0 |         2 |        17 |       259 |         1 |        32 |       351 | 
##              |     0.114 |     0.000 |     0.006 |     0.048 |     0.738 |     0.003 |     0.091 |     0.089 | 
##              |     0.050 |     0.000 |     0.006 |     0.047 |     0.357 |     0.003 |     0.052 |           | 
##              |     0.010 |     0.000 |     0.001 |     0.004 |     0.065 |     0.000 |     0.008 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           RB |         6 |        25 |         8 |        10 |        11 |       291 |         0 |       351 | 
##              |     0.017 |     0.071 |     0.023 |     0.028 |     0.031 |     0.829 |     0.000 |     0.089 | 
##              |     0.007 |     0.033 |     0.024 |     0.027 |     0.015 |     0.834 |     0.000 |           | 
##              |     0.002 |     0.006 |     0.002 |     0.003 |     0.003 |     0.074 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           RW |        37 |         0 |         0 |         0 |       268 |        14 |        27 |       346 | 
##              |     0.107 |     0.000 |     0.000 |     0.000 |     0.775 |     0.040 |     0.078 |     0.087 | 
##              |     0.046 |     0.000 |     0.000 |     0.000 |     0.370 |     0.040 |     0.044 |           | 
##              |     0.009 |     0.000 |     0.000 |     0.000 |     0.068 |     0.004 |     0.007 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           ST |        19 |         1 |         0 |         0 |        82 |         1 |       541 |       644 | 
##              |     0.030 |     0.002 |     0.000 |     0.000 |     0.127 |     0.002 |     0.840 |     0.163 | 
##              |     0.024 |     0.001 |     0.000 |     0.000 |     0.113 |     0.003 |     0.877 |           | 
##              |     0.005 |     0.000 |     0.000 |     0.000 |     0.021 |     0.000 |     0.137 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       802 |       761 |       339 |       365 |       725 |       349 |       617 |      3958 | 
##              |     0.203 |     0.192 |     0.086 |     0.092 |     0.183 |     0.088 |     0.156 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

4. Conclusion and further research All in all, position classification is possible for some distinct areas of the football field, but for some specific ones is quite impossible, in the case of multiclass classification. We have tried some specific models for RW&LW, and CM&CAM, respectively, but the results we obtained were not far from random. This is because multiple footballers have the necessary attributes to equally play in multiple spots. In order to improve classification, a multilabel approach on all the player positions would be better.

On one hand, football is a very heterogeneous sport and often the values of the attributes cannot explain as a whole the position of a player since his style of play heavily influence how the role is interpreted and consequently where exactly the player acts on the field. On the other hand, we would also like to believe that with sufficient data, even effective positioning of real players could be calculated.